Python Queue对象引起的RecursionError:maximum recursion depth exceeded

有网友想下载2014年4月至今的新加坡PSI数据:Historical PSI Readings,看这页面比较简单就练手写了个爬虫。

运行的时候出现RecursionError

Traceback (most recent call last):
  File "C:\Python 3.5\lib\multiprocessing\queues.py", line 241, in _feed
    obj = ForkingPickler.dumps(obj)
  File "C:\Python 3.5\lib\multiprocessing\reduction.py", line 50, in dumps
    cls(buf, protocol).dump(obj)
RecursionError: maximum recursion depth exceeded

递归深度超过限制?但是我代码中没用递归啊。翻了一下queues.py的源码,原来multiprocessing.Queue在首次调用put()方法时会启动一个名为QueueFeederThread的调度线程,用于分配数据。该线程传输数据前会先调用ForkingPickler.dumps(obj)将数据序列化成bytes,这个过程中抛出RecursionError。后续调用我翻了半小时也没看懂,暂时放一边吧。

Google一番得知Python有默认的递归深度限制

>>> import sys
>>> sys.getrecursionlimit()
1000

把它设置得大一点,比如一百万:

sys.setrecursionlimit(1000000)

再次执行脚本,果然不再报错。

附代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Author: loveNight

import os
import sys
import csv
import time
import requests
import threading
from multiprocessing import Queue
from datetime import datetime, timedelta
from bs4 import BeautifulSoup as BS
from multiprocessing.dummy import Pool

sys.setrecursionlimit(1000000) # 递归深度,默认只有900
os.chdir(sys.path[0])

url_pattern = r"http://www.nea.gov.sg/anti-pollution-radiation-protection/air-pollution-control/psi/historical-psi-readings/year/{0}/month/{1}/day/{2}"
# 表头
table_header = ["Year", "Month", "Day", "Time", "North",
"East", "West", "Central", "Overall Singapore"]

headers = {
"Accept-Encoding": "gzip,deflate,sdch",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0",
"Host": "www.nea.gov.sg",
}
session = requests.Session()
session.headers = headers
delay = 0 # 网络请求之间的间隔

QUIT = "Quit"
queue = Queue()

# 要下载的日期
dt = datetime(2014, 4, 1)
dt_now = datetime.now()
todo = []
while dt <= dt_now:
todo.append(dt)
dt += timedelta(days=1)


# 打开网页
def getPage(url):
if delay:
time.sleep(delay)
return session.get(url).text


# 写入文件
def save(filename):
start = time.time()
with open(filename, "w", newline="") as output:
writer = csv.writer(output)
writer.writerow(table_header)
while True:
lines = queue.get()
if isinstance(lines, str) and lines == QUIT:
break
else:
print("拿到数据,正在写入", datetime.now())
writer.writerows(lines)
print("写入完成!用时 %s " % (time.time() - start))


# 解析指定日期的页面
def resolvePage(dt):
year = dt.year
month = dt.month
day = dt.day
html = getPage(url_pattern.format(year, month, day))
soup = BS(html, "lxml") # 需要安装第三方库lxml,也可以使用自带的html.parser
table = soup.find(name="table", class_="text_psinormal")
if table:
trs = table.find_all("tr")
trs = trs[2:] # 去除表头
lines = []
for tr in trs:
datas = [year, month, day] + [x for x in tr.strings if x != "\n"]
lines.append(datas)
queue.put(lines) # 传入整张表


# 开始下载
filename = "data.csv"
t = threading.Thread(target=save, args=(filename,))
t.daemon = True
t.start()

pool = Pool(30)
pool.map(resolvePage, todo)
pool.close()
pool.join()

queue.put(QUIT)

下载用时127秒。

下载得到的.csv文件,可用Excel打开

loveNight wechat
我的微信公众号,放一些有趣的内容,不定期更新。